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Abstract 


In this short note, we present an extension of long short-term memory (LSTM) 
neural networks to using a depth gate to connect memory cells of adjacent layers. 
Doing so introduces a linear dependence between lower and upper layer recurrent 
units. Importantly, the linear dependence is gated through a gating function, which 
we call depth gate. This gate is a function of the lower layer memory cell, the input 
to and the past memory cell of this layer. We conducted experiments and verified 
that this new architecture of LSTMs was able to improve machine translation and 
language modeling performances. 


1 Introduction 

Deep neural networks (DNNs) have been successfully applied to many areas, including speech 0 
and vision 0 - On natural language processing tasks, recurrent neural networks (RNNs) |3}|5j are 
widely used because of their ability to memorize long-term dependency. 

A typical problem of training deep networks, including RNNs, is gradient diminishing and ex¬ 
plosion. This problem is apparent when training a simple RNN. The long short-term memory 
(LSTM) |6j[7| neural networks is an extension of simple RNN |[3j. In LSTM, a memory cell has 
linear dependence of its current activity and its past activity. Importantly, a forget gate is used to 
modulate the information flow between the past and the current activities. LSTMs also have input 
and output gates to modulate its input and output. 

Perhaps the introduction of gating functions in @0 is the most significant improvement to the 
recurrent neural networks [3|j. More recently, the Gated Recurrent Unit [8| has also adopted the 
concept of using gates. LSTMs and GRUs are widely used in many natural language processing 
tasks HU- 

To construct a deep neural networks, the standard way is to stack many layers of neural networks. 
This however has the same problem of building simple recurrent networks. The difference here is 
that the error signals from the top, instead of from the last time instance, have to be back-propagated 
through many layers of nonlinear transformations and therefore the error signals might be either 
diminished or exploded. 
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Figure 1: LSTM 


This short note investigates an extension of LSTMs that uses a depth-gate to connect memory cells 
of the lower and upper layers. We review recurrent neural networks in Sec. [2] Section [3] presents 
the extension. Experiments are in Sec. [4] We relate this extension with other works in Sec. [5] and 
conclude in Sec. [6] 

2 Review of recurrent neural networks 

A recurrent neural network 00 has a hidden state h t that depends on its past value h t .\ recursively; 
i.e., 


fh = g( Whh/it-i + w xh a; t ) (l) 

where g(-) is usually a nonlinear function such as tanh. x t is the input. Whh and W x h are the weight 
matrices. 


2.1 Long short-term memory (LSTM) 

LSTM was initially proposed in |6 71 and later modified in [111. We follow the implementation 
in (TT), which is illustrated in Fig. |T] LSTM introduces a linear dependence between its memory 
cells c t and its past c t _i. Additionally, LSTM has input and output gates. Specially, LSTM is written 
below as 


it 

= cr(W xi a; t + Whi/it-i + WdCt-i) 

(2) 

ft 

= o- (W rf ® t + W hf i + W cf Ct. i) 

(3) 

c t 

= ft © c t -i + i t © tanh{ W xc x t + W hc h t _!) 

(4) 

o t 

= cr(W xo a: t + W ho h t _i + W C0 c t ) 

(5) 

h t 

= o t © tanh(ct) 

(6) 


where z t , / t and o t are input gate, forget gate and output gate of LSTM. h, is the output from the 
LSTM. er(-) is the logistic function. © denotes element-wise product. In our application of LSTM, 
the forget gate and input gate share the same parameters but are computed as / t = 1 — i { . Note that 
bias terms are omitted in the above equations but they are applied by default. 

2.2 Stacked LSTMs 

Typically, LSTMs are stacked to form deep recurrent neural networks, illustrated in the left figure 
of Figure [2] 

The output from the lower layer LSTM at layer L is h ' ! . With a possible affine transformation, this 
output is used as input x [ L+1 ' in the upper layer LSTM at layer L + 1. Except for this output-input 
connection, there is no other connections between the two layers. 



3 The Depth-gated LSTM 


The depth-gated LSTM (DGLSTM) His illustrated in the right figure of Fig. [ 2 ] It has a depth gate 
that connects the memory cells c t * L+ ■ in the upper layer L + 1 and the memory cell cJ t in the lower 
layer L. The depth-gate controls how much flow from the lower memory cell directly to the upper 
layer memory cell. The gate function at layer L + 1 at time / is a logistic function as 


»(L+1) /¥ I 

d\ = a(b t 


'd L+1) 


- W^ +l) £C t (L+1) 


W. 


(L+l) . (L+l) 


cd 


© c t . 


w 


(L+l) 


© c t L) ) 


(7) 


where b d L+ '' ) is a bias term. W^ +1 ' is the weight matrix to relate depth gate to the input of this 
layer. The past memory cell is also related via a weight vector uW +11 . To relate the lower layer 
memory, it uses a weight vector Note that, if lower and upper layer memory cells have 

different dimension, should be a matrix instead of a vector. 




Figure 2: Illustration of the stacked LSTM and the depth-gated LSTM. Notice the additional con¬ 
nection between memory cells in the lower and upper layers in the depth-gated LSTM. 


Using the depth gate, a DGLSTM computes the memory cell at layer L + 1 as follows 

c t (L+1) = d t (L+1) 0 G (L) + /t (L+1) © c t ( ^ +1) + *t (L+1) 0 tanh{ w£ +1 )x t (L+1 > + W^ +1) h^ +I) )(8) 


In DGLSTM, equations ([2ji, Q, (|5]» and (|6| are the same as the standard LSTM, except that 
DGLSTM uses a superscript L + 1 to denote operations at layer L + l. 

The idea of using gated linear dependence can also be used to connect the first layer memory cell 
c t 11 with the feature observation x f 1 . In this case, the depth-gate is computed for L = 0 as follows 


4° = o-(4 1} + W^* t (I) + O 4V), 


and the memory cell is computed as 


<4 = 4 '* © 


(W) d ' x, 


(1)^(0) 


/ t (1) 0 4V + i t (l) 0 tanh(W2)x[ 0) + W 44V) 


(9) 

( 10 ) 


’implementation at https://github.com/kaishengyao/cnn/blob/master/cnn/dglstm.cc and 

https://github.eom/kaishengyao/cnn/blob/master/cnn/dglstm.h. 
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Table 1: BLEU scores in BTEC Chinese to English machine translation task 


Depth GRU LSTM 

3 3335 32.43 

5 32.73 33.52 

10 30.72 31.99 


DGLSTM 

34.48 

33.81 

32.19 


Table 2: BLEU scores by reranking on BTEC Chinese to English machine translation task 

Dataset Baseline DGLSTM 

~Dev 2C61 30^5 

Test 40.63 43.08 


4 Experiments 

We applied DGLSTMs on two datasets. The first is BTEC Chinese to English machine translation 
task. Its training set consists of 44016 sentence pairs. We use its devsetl and devset 2 for validation, 
which in total have 1006 sentence pairs. We use its devset3 for test, which has 506 sentence pairs. 

The second dataset is PennTreeBank (PTB) for language modeling. It consists of 42075 sentences 
for training, 3371 sentences for development, and 3762 sentences for test. 

4.1 Machine translation results 

We conducted preliminary experiments and observed that the attention model |9] performed better 
than the encoder-decoder method We therefore applied the attention model j9]| in our exper¬ 
iments. Both encoder and decoder used recurrent neural networks in 0. However, in this experi¬ 
ment, we only used recurrent neural networks for decoder. For encoder, we used word embedding 
learned in the training set. 

A preliminary experiment showed that the simple RNN |3J performed the worst. We therefore don’t 
include the simple RNN results in this paper. We compared DGLSTM with GRU and LSTM. All 
these models used 200-dimension hidden layer. We varied the depth of RNNs. Results in Table |T| 
show that DGLSTM outperforms LSTM and GRU in all of the tested depths. 

In another experiment for the machine translation experiment, we used attention model with 
DGLSTM to rescore test set k-best lists. We first trained two attention models, one was with 3 
layers of DGLSTMs and the other was with 5 layers of DGLSTMs, on training set. Both used 
50-dimension hidden layers. We then trained a reranker model using the development data with 
100-best lists for each translation pair. The top 100 best lists were generated from the baseline. 
The features for the reranker models are the scores from the attention model. The 100-best lists on 
the test set were reranked using the trained reranker model. We ran the above described reranking 
processes 10 times to get an averaged BLEU scores, which was obtained using one reference. The 
BLEU scores are listed in Table [2] Compared to the baseline, DGLSTM improved BLEU scores by 
3 points on the Test set. 

4.2 Language modeling 

We conducted experiments on PTB dataset. We trained a two layer DGLSTM. Each layer has 200 
dimension vector. Test set perplexity results are shown in Table [3] Compared against the previously 
published results on PTB dataset, DGLSTM obtained the lowest perplexity on PTB test set to our 
knowledge. 


5 Related works 


We developed this method independently in a summer workshop and later knew the works in 1 13|14| . 
In highway networks in {L3), the output from a layer y, is a linear function to the input x t , in addition 
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Table 3: Penn Treebank Test Set Results. 



Model 


Perplexity 


123 

117 

110 


DOT(s)-KNN 1121 108 


DGLSTM 


96 


to the output from a nonlinear path. Both of them are gated as follows 

Vt = H(® t , W hh ) 0 T(® t , W xT ) + x t © C(x u W c ) 


( 11 ) 


where T and C are called transform gate and carry gate, respectively. H( ) is the output from 
a nonlinear path. Whh, W x t, and W c are matrices. Therefore, the highway network output has 
a direct and linear connection, albeit gated, to the input. This allows highway networks to train 
extremely deep networks easily. 

DGLSTM is related to the highway networks in using the same idea of linear and gated connection 
to input. It differs from the highway networks in not using a specific gate on the non-linear path; 
DGLSTM keeps the input and output gates which are applied on the non-linear transformations 
in LSTMs. However, the overall effect of using the input and output gates may be similar to the 
transfer gate in the highway networks. An additional but important difference is that DGLSTM 
linearly connects the memory cells in the lower and upper layers. Because of this, the memory 
cell in DGLSTM has errors back-propagated both from the future and from the top layer, linearly 
albeit gated. This might be the biggest difference from the highway networks G3 in its current 
implementation. 

Perhaps the closet work to this research is Grid LSTM (?4) , which uses LSTMs in different dimen¬ 
sions and connects them using gated linear connections. Because the dimensions can include not 
only time, as the typical recurrent neural networks, but also depth and others. Grid LSTM is more 
general than DGLSTM, which only considers time and depth. Also, Grid LSTM uses a generic form 
of input, memory, and output. Doing so allows a memory cell to have a gated linear dependence not 
only on its past memory cell but also on the past observations. Therefore, we consider DGLSTM as 
a specific and simple case of Grid LSTM that has gate applied to time and depth only on memory 
cells. However, DGLSTM, Grid LSTM and highway networks share the same idea of stacking net¬ 
works with both linear but gated connections and nonlinear paths. This idea can be applied to fully 
connected, convolutional or recurrent layers. 

6 Conclusions 

We have presented a depth-gated LSTM architecture, which uses a depth-gate to have gated linear 
connection between lower and upper layer memory cells. We observed better performances using 
this new architecture on machine translation and language modeling tasks. This architecture is 
related to the highway networks [13 j and Grid LSTM [| 1 4| in using an additional linear connection 
with gates to regulate information flow across layers. 
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